In this homework, you will be using the FIFA 2022 Dataset, which is a .csv where each row is a player in the FIFA 2022 video game. Each player is described by a variety of attributes, like crossing ability, stamina, etc. Each attribute is described by a "grade". The data contains the 500 most valuable players in the game.
The list of attributes is ['Crossing', 'Finishing', 'HeadingAccuracy', 'ShortPassing', 'Dribbling', 'FKAccuracy', 'LongPassing', 'BallControl', 'Acceleration', 'SprintSpeed', 'Reactions', 'ShotPower', 'Stamina', 'Strength', 'LongShots', 'Aggression', 'Penalties', 'StandingTackle', 'GKDiving', 'GKHandling', 'GKKicking', 'GKPositioning', 'GKReflexes']. Consider these the features. You will need to use these for PCA and t-SNE.
Start by uploading the fifa.csv to the notebook. Then run the code below, which downloads the player photos.
If there is code that requires a random seed (for example, t-SNE) please set it to 2022.
### DO NOT EDIT! ###
import pandas as pd
import requests
import time
from tqdm import tqdm
fifa = pd.read_csv("fifa.csv") # Make sure you upload the csv
images = []
for i, row in tqdm(fifa.iterrows()):
resp = requests.get(row["Photo"])
with open(str(row["ID"]) + ".png", "wb") as f:
f.write(resp.content)
images.append(str(row["ID"]) + ".png")
500it [00:50, 9.80it/s]
import base64, io, IPython
from PIL import Image as PILImage
imgCode = []
for imgPath in images:
image = PILImage.open(imgPath)
output = io.BytesIO()
image.save(output, format='PNG')
encoded_string = "data:image/png;base64,"+base64.b64encode(output.getvalue()).decode()
imgCode.append(encoded_string)
fifa["image"] = imgCode
Create a new variable, Position, and group the following positions (found in the Best Position feature) together:
LB, RB, LWB, RWB - Wing Back
RW, LW, RM, LM - Winger
CAM, CDM, CM - Central Midfielder CF, ST - Striker
CB - Central Defender
GK - Goalkeeper
Below is a diagram of the soccer positions and their groupings.
# code for step 1 here
Position = []
Pos_map = {
"CF" : "Striker",
"ST" : "Striker",
"LW" : "Winger",
"RW" : "Winger",
"LM" : "Winger",
"RM" : "Winger",
"CAM" : "Midfielder",
"CM" : "Midfielder",
"CDM" : "Midfielder",
"LWB" : "Wingback",
"RWB" : "Wingback",
"LB" : "Wingback",
"RB" : "Wingback",
"CB" : "Central Defender",
"GK" : "Goalkeeper"
}
for i, row in tqdm(fifa.iterrows()):
Position.append(Pos_map[row["Best Position"]])
fifa["Position"] = Position;
500it [00:00, 27850.62it/s]
Using the list of attributes above, reduce the dimensionality of the data using 2 principal components.
Plot the first and second components using altair. Color each point based on its grouping from Step 1, and make sure the tooltip for each point contains the player name, position and photo (use the image feature created in the first two code blocks).
What groupings of players are most directly visible?
import altair as alt
# Prepare Data(X) and Labels(Y)
X = []
Y = []
Pos2Group = {"Striker" : 0, "Winger" : 1, "Midfielder" : 2, "Wingback" : 3, "Central Defender" : 4, "Goalkeeper" : 5}
for i,row in tqdm(fifa.iterrows()):
X.append([row['Crossing'],
row['Finishing'],
row['HeadingAccuracy'],
row['ShortPassing'],
row['Dribbling'],
row['FKAccuracy'],
row['LongPassing'],
row['BallControl'],
row['Acceleration'],
row['SprintSpeed'],
row['Reactions'],
row['ShotPower'],
row['Stamina'],
row['Strength'],
row['LongShots'],
row['Aggression'],
row['Penalties'],
row['StandingTackle'],
row['GKDiving'],
row['GKHandling'],
row['GKKicking'],
row['GKPositioning'],
row['GKReflexes']
])
Y.append(Pos2Group[Position[i]])
500it [00:00, 11658.10it/s]
from sklearn.decomposition import PCA
import pandas as pd
pca = PCA(n_components=2)
pca.fit(X)
X_trans = pca.transform(X)
print(pca.explained_variance_ratio_)
[0.56325308 0.19545388]
fifa["PCA_First"] = X_trans[:,0]
fifa["PCA_Second"] = X_trans[:,1]
alt.Chart(fifa, title="PCA FIFA").mark_point().encode(
x="PCA_First",
y="PCA_Second",
color="Position",
tooltip=list(["Name","Position","image"])
)
The Goalkeepers are the most directly visible in this graph.
Now, remove the goalkeepers from the data set, and re-run PCA. What does the first principal component seem to indicate (as you scan over the range of the 1st principal component, what relationship do you see?)
# Drop GK from the dataset
drop_list = []
for i,row in tqdm(fifa.iterrows()):
if row["Best Position"] == "GK":
drop_list.append(i)
fifa_ng = fifa.drop(drop_list);
500it [00:00, 27850.25it/s]
# Prepare Data(X_NG) and Labels(Y_NG)
X_NG = []
Position_NG = []
for i,row in tqdm(fifa_ng.iterrows()):
X_NG.append([row['Crossing'],
row['Finishing'],
row['HeadingAccuracy'],
row['ShortPassing'],
row['Dribbling'],
row['FKAccuracy'],
row['LongPassing'],
row['BallControl'],
row['Acceleration'],
row['SprintSpeed'],
row['Reactions'],
row['ShotPower'],
row['Stamina'],
row['Strength'],
row['LongShots'],
row['Aggression'],
row['Penalties'],
row['StandingTackle'],
row['GKDiving'],
row['GKHandling'],
row['GKKicking'],
row['GKPositioning'],
row['GKReflexes']
])
Position_NG.append(Pos_map[row["Best Position"]])
pca_ng = PCA(n_components=2)
pca_ng.fit(X_NG)
X_NG_trans = pca_ng.transform(X_NG)
print(pca_ng.explained_variance_ratio_)
473it [00:00, 11312.23it/s]
[0.45826186 0.1585531 ]
fifa_ng["PCA_NG_First"] = X_NG_trans[:,0]
fifa_ng["PCA_NG_Second"] = X_NG_trans[:,1]
fifa_ng["Position"] = Position_NG
alt.Chart(fifa_ng, title="PCA FIFA (Non Goalkeeper)").mark_point().encode(
x="PCA_NG_First",
y="PCA_NG_Second",
color="Position",
tooltip=list(["Name","Position","image"])
)
The first principal component seems to indicate the defense ability because the central defender and wingback appears on the right side.
Using the t-SNE function from sklearn, create plot the results of using 2 components with the rest of the parameters set to the default. Set the random state to 2022.
Be sure to plot in the same way as you did in 2.1 and 2.2 (using the same color indications, tooltips, etc.)
What relationships do you see in the t-SNE output?
# code here
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2,random_state=2022)
X_TSNE = tsne.fit_transform(X)
fifa["TSNE_First"] = X_TSNE[:,0]
fifa["TSNE_Second"] = X_TSNE[:,1]
alt.Chart(fifa, title="TSNE FIFA").mark_point().encode(
x="TSNE_First",
y="TSNE_Second",
color="Position",
tooltip=list(["Name","Position","image"])
)
tsne_ng = TSNE(n_components=2,random_state=2022)
X_TSNE_NG = tsne_ng.fit_transform(X_NG)
fifa_ng["TSNE_NG_First"] = X_TSNE_NG[:,0]
fifa_ng["TSNE_NG_Second"] = X_TSNE_NG[:,1]
alt.Chart(fifa_ng, title="TSNE FIFA (Non Goalkeeper)").mark_point().encode(
x="TSNE_NG_First",
y="TSNE_NG_Second",
color="Position",
tooltip=list(["Name","Position","image"])
)
The First component of TSNE is more likely to be the measure of ATTACK abilities, for the attackers(Striker, Winger, etc.) are on the left side in this graph.
The second component of TSNE in the first graph is likely to be the measure of position. The players who stays in the front of the field have a lower value, while those who stays at the back of the field have a higher value.
Now, do the same as above but only for central midfielders. This time, instead of indicating the broad position group in your chart (which would be "Midfielders" in this case), indicate the specific position through color, like "CAM", "CDM" and so on.
What relationships do you see in the t-SNE output?
# Drop GK from the dataset
mid_list = []
for i,row in tqdm(fifa.iterrows()):
if (not row["Best Position"] == "CAM") and (not row["Best Position"] == "CDM") and (not row["Best Position"] == "CM"):
mid_list.append(i)
fifa_mid = fifa.drop(mid_list);
#Prepare the data of midfielders
X_MID = []
Position_MID = []
for i,row in tqdm(fifa_mid.iterrows()):
X_MID.append([row['Crossing'],
row['Finishing'],
row['HeadingAccuracy'],
row['ShortPassing'],
row['Dribbling'],
row['FKAccuracy'],
row['LongPassing'],
row['BallControl'],
row['Acceleration'],
row['SprintSpeed'],
row['Reactions'],
row['ShotPower'],
row['Stamina'],
row['Strength'],
row['LongShots'],
row['Aggression'],
row['Penalties'],
row['StandingTackle'],
row['GKDiving'],
row['GKHandling'],
row['GKKicking'],
row['GKPositioning'],
row['GKReflexes']
])
Position_MID.append(row["Best Position"])
500it [00:00, 25431.12it/s] 188it [00:00, 11545.93it/s]
# code here
tsne_mid = TSNE(n_components=2,random_state=2022)
X_MID_trans = tsne_mid.fit_transform(X_MID)
fifa_mid["TSNE_MID_First"] = X_MID_trans[:,0]
fifa_mid["TSNE_MID_Second"] = X_MID_trans[:,1]
alt.Chart(fifa_mid, title="TSNE FIFA (ONLY MID)").mark_point().encode(
x="TSNE_MID_First",
y="TSNE_MID_Second",
color="Best Position",
tooltip=list(["Name","Best Position","image"])
)
The CM stays between CAM and CDM.
I think the second component shows the ability of defense. While the meaning of the first component is not so obvious.
Read this Distill on t-SNE parameters.
The perplexity parameter is described in the Distill as
"...which says (loosely) how to balance attention between local and global aspects of your data. The parameter is, in a sense, a guess about the number of close neighbors each point has."
Create two plots, one where you select a perplexity such that the points appear more tightly clustered (many small, tight clusters), and another where they are less tightly clustered (larger, less clear clusters). The default perplexity is 30, so consider this the baseline. Be sure to set the random seed to 2022 and use all of the data.
# code here
tsne_less_pp = TSNE(n_components=2,perplexity=5,random_state=2022)
X_less_pp = tsne_less_pp.fit_transform(X)
fifa["TSNE_LESS_PP_First"] = X_less_pp[:,0]
fifa["TSNE_LESS_PP_Second"] = X_less_pp[:,1]
alt.Chart(fifa, title="tSNE-Less FIFA").mark_point().encode(
x="TSNE_LESS_PP_First",
y="TSNE_LESS_PP_Second",
color="Position",
tooltip=list(["Name","Position","image"])
)
# code here
tsne_greater_pp = TSNE(n_components=2,perplexity=50,random_state=2022)
X_greater_pp = tsne_greater_pp.fit_transform(X)
fifa["TSNE_GREAT_PP_First"] = X_greater_pp[:,0]
fifa["TSNE_GREAT_PP_Second"] = X_greater_pp[:,1]
alt.Chart(fifa, title="tSNE-Greater FIFA").mark_point().encode(
x="TSNE_GREAT_PP_First",
y="TSNE_GREAT_PP_Second",
color="Position",
tooltip=list(["Name","Position","image"])
)
Answer the following questions:
(1) What do cluster sizes mean in t-SNE (e.g., one cluster with a large standard deviation vs. another with a tighter distribution)?
(2) Do distances between clusters or points mean something?
(3) What are some advantages of t-SNE over PCA?
(4) What are some disadvantages of t-SNE over PCA?
(1) The cluster with a smaller size means the points in this cluster have smaller distances in the probability space, and it means that the cluster has a tigher distribution.
(2) The distance between points in tSNE represents the ratio of their probability distance which is mapped from their Euclidian Distance. If two points have a smaller distance, they should probablity be more similar than the others. And so do the clusters.
(3) T-SNE usually shows better visualization results than PCA. T-SNE are more likely to make obvious clusters because it focus more on the local features of high dimension data.
(4) T-SNE is too slow and costs too much computational resources in large dataset tasks. And the parameters (e.g. distance, perplexity) of t-SNE are not so easily interpretable as those of PCA.